9 research outputs found

    Rates of DNA Sequence Profiles for Practical Values of Read Lengths

    Full text link
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size qq, read length ℓ\ell, and word length nn.Consequently, we demonstrate that for q≥2q\ge 2 and n≤qℓ/2−1n\le q^{\ell/2-1}, the number of profile vectors is at least qκnq^{\kappa n} with κ\kappa very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors

    Sequences and their applications

    No full text
    Binary and q-ary sequences have always been used in communication channel as the carrier or the vessel of information. In order to establish an efficient and error-free communication channel, investigations on the properties of sequences are crucial. This dissertation is devoted to the study of sequences to investigate its properties which can be useful to establish a more reliable communication channel. The properties that we will investigate in this dissertation are the linear complexities of sequences and the reconstruction of a sequence from its sub-sequences. The linear complexity of a binary sequence is defined as the length of the shortest linear feedback shift-register that generates the binary sequence. In the first part of this dissertation, we devised a novel and efficient algorithm to find the linear complexity of any binary sequence. This algorithm is a generalization of the well-known Games-Chan algorithm. Furthermore, this algorithm can be applied in linear time and is faster than previous well-known algorithms for certain parameters of length of periodic sequences. The second property that we will investigate in this dissertation is the reconstruction capability of binary sequences in particular from its subsequences. This is called the sequence reconstruction problem. The problem considers a communication scenario where the sender transmits a sequence from some codebook and the receiver obtains multiple noisy reads of the original sequence. Noisy reads here refer to possibly erroneous copies of the original sequence. The receiver/decoder then aims to reconstruct the original sequence from these noisy reads. The error that we consider in this dissertation is only deletion error where some elements of the original sequence might be deleted. Thus, the noisy reads here are in the form of subsequences of the original sequence. There are two variants of the problem that we are going to consider for this problem. Firstly, we assume that the decoder receives all possible subsequences of a certain fixed length from the original sequence, including its multiplicity. In other words, the decoder obtains the profile of sub-sequences of the original sequence, namely the k-deck of the binary sequence . The k-deck of a sequence is defined as the multiset of all its subsequences of length k. We determine the exact value of the number of distinct k-decks for all binary sequences of the same length for small values of k and provide asymptotic estimates of this value when k is fixed. Specifically, we introduce a trellis-based method to compute this value for fixed k in polynomial time. The second variant is under the assumption that the decoder does not receive every possible subsequence of the original sequence, but only receives some fixed number of noisy reads. In other words, the decoder receives a fixed number of subsequences. In this case, we also assume that all received noisy reads are distinct. We construct codes that are capable of correcting t deletions with multiple noisy reads. Special attention is given to the case when t=1 and t=2.Doctor of Philosoph

    Correcting deletions with multiple reads

    No full text
    The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads N is fixed. Of significance, for the single-deletion channel, using log2log2 n +O(1) redundant bits, we designed a reconstruction code of length n that reconstructs codewords from two distinct noisy reads (Cai et al., 2021). In this work, we show that log2log2 n -O(1) redundant bits are necessary for such reconstruction codes, thereby, demonstrating the optimality of the construction. Furthermore, we show that these reconstruction codes can be used in t-deletion channels (with t ≥ qslant 2) to uniquely reconstruct codewords from nt-1/(t-1)!}+O ({nt-2) distinct noisy reads. For the two-deletion channel, using higher order VT syndromes and certain runlength constraints, we designed the class of higher order constrained shifted VT code with 2log2 n +o(log2(n)) redundancy bits that can reconstruct any codeword from any N ≥ 5 of its length-(n-2) subsequences.Ministry of Education (MOE)This work of Han Mao Kiah was supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 Award MOE-T2EP20121-0007. The work of Eitan Yaakobi was supported in part by the Israel Innovation Authority under Grant 75855 and in part by the Technion Data Science Initiative

    On the number of DNA sequence profiles for practical values of read lengths

    No full text
    A recent study by one of the authors has demonstrated the relevance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size q, read length ℓ, and word length n. Consequently, we demonstrate that for q ≥ 3 and n = q a ℓ, a = o(ℓ), the number of profile vectors is at least q κn for some constant 0 <; κ ≤ 1. In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for a family of profile vectors.MOE (Min. of Education, S’pore)Accepted versio

    Rates of DNA sequence profiles for practical values of read lengths

    No full text
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size q, read length 1, and word length n. Consequently, we demonstrate that for q ≥ 2 and n ≤ q 1/2-1 , the number of profile vectors is at least q κn with κ very close to 1. In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for certain families of profile vectors.Accepted versio

    Efficient encoding/decoding of irreducible words for codes correcting tandem duplications

    No full text
    Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. All code constructions are based on irreducible words. We study efficient encoding/decoding methods for irreducible words. First, we describe an (ell, m) -finite state encoder and show that when m=Θ(1ϵ) and ell=Θ(1ϵ), the encoder has rate that is ϵ away from the optimal. Next, we provide ranking/unranking algorithms for irreducible words and modify the algorithms to reduce the space requirements for the finite state encoder.MOE (Min. of Education, S’pore)Accepted versio
    corecore